Sequence-to-sequence translation from mass spectra to peptides with a transformer model

A fundamental challenge in mass spectrometry-based proteomics is the identification of the peptide that generated each acquired tandem mass spectrum. Approaches that leverage known peptide sequence databases cannot detect unexpected peptides and can be impractical or impossible to apply in some settings. Thus, the ability to assign peptide sequences to tandem mass spectra without prior information—de novo peptide sequencing—is valuable for tasks including antibody sequencing, immunopeptidomics, and metaproteomics. Although many methods have been developed to address this problem, it remains an outstanding challenge in part due to the difficulty of modeling the irregular data structure of tandem mass spectra. Here, we describe Casanovo, a machine learning model that uses a transformer neural network architecture to translate the sequence of peaks in a tandem mass spectrum into the sequence of amino acids that comprise the generating peptide. We train a Casanovo model from 30 million labeled spectra and demonstrate that the model outperforms several state-of-the-art methods on a cross-species benchmark dataset. We also develop a version of Casanovo that is fine-tuned for non-enzymatic peptides. Finally, we demonstrate that Casanovo’s superior performance improves the analysis of immunopeptidomics and metaproteomics experiments and allows us to delve deeper into the dark proteome.

Each panel evaluates the peptide-level performance on the held-out species in the nine species benchmark.For Casanovo bm all peptides that pass the precursor m/z filter are ranked above peptides that do not pass the filter, and the boundary is indicated by a diamond on the curve.Table S3: The most common amino acid swaps have positive BLOSUM scores.We found that the top ten most common single amino acid substitutions that can be explanied with a single nucleotide polymorphism detected by Casanovo are enriched for positive BLOSUM scores.

Figure S1 :
Figure S1: Casanovo bm outperforms Novor, DeepNovo, and PointNovo on the original nine species benchmark.Each panel evaluates the peptide-level performance on the held-out species in the nine species benchmark.For Casanovo bm all peptides that pass the precursor m/z filter are ranked above peptides that do not pass the filter, and the boundary is indicated by a diamond on the curve.

Figure S2 :Figure S3 :Figure S4 :
Figure S2: Casanovo outperforms Novor, DeepNovo, PointNovo, and Casanovo bm on the revised nine-species benchmark Each panel evaluates the peptide-level performance on the held-out species in the nine species benchmark.For Casanovo and Casanovo bm all peptides that pass the precursor m/z filter are ranked above peptides that do not pass the filter, and the boundary is indicated by a diamond on each curve.

Figure S5 :Figure S6 :
FigureS5: Casanovo performance on the 9-species benchmark improves with more training data.Each point corresponds to a Casanovo model trained on one of the nested subsets of MassIVE-KB, ranging from 250,000 spectra to the full dataset of 28 million spectra.Average precision is reported on the revised 9-species benchmark.

Figure S7 :Figure S8 :
Figure S7: Casanovo outperforms Novor, DeepNovo, PointNovo, and Casanovo bm at the amino acid-level on the nine-species benchmark Each panel evaluates the amino acid-level performance on the held-out species in the nine species benchmark.For Casanovo and Casanovo bm all peptides that pass the precursor m/z filter are ranked above peptides that do not pass the filter, and the boundary is indicated by a diamond on each curve.

Figure S9 :Figure S10 :
FigureS9: Breakdown of de sequencing performance by charge state.Plots show peptide precision-coverage curves for subsets of the revised nine-species benchmark, grouped by charge state where panels correspond to spectra with (A) 2+ charge, (B) 3+ charge, (C) 4+ or higher charge.

Figure S11 :Figure S12 :Figure S13 :Figure S14 :
Figure S11:Casanovo identifies a greater number of immunopeptides than Tide database search.The plot shows the overlap between unique peptides assigned by Casanovo that match to the human proteome and by Tide at 1% FDR for the immunopeptidomics dataset.

Figure S15 :
Figure S15: Sinusoidal encodings represent m/z distance between peaks in a mass spectrum.(A) The m/z value of each peak is encoded from a progression of sinusoids defined by a minimum and maximum wavelength.In this example, a 6-dimensional embedding (d = 6) of m/z 1.0 and m/z 3 is created from sinusoids ranging from a wavelength of m/z 1 (λ min = 1) to 10 (λ max = 10)to demonstrate how the encoding is performed.(B) The Casanovo sinusoidal embeddings are 512-dimensional (d = 512) and created from sinusoids ranging from m/z 0.0001 (λ min = 0.0001) to 10,000 (λ min = 10, 000).The utility of these embeddings lies in their preservation of m/z distance in their embedded space.Here, we sample 10,000 pairs of m/z values between m/z 0 and 2000.The cosine similarity between these embeddings is negatively correlated with the original distance between m/z values.

Table S1 :
Creating a non-enzymatic dataset by sampling from PROSPECT and MassIVE-KB.PROSPECT was first downsampled to include at most 100 PSMs per peptide sequence.MassIVE-KB and PROSPECT were then segregated by C-terminal amino acid, and we randomly selected from each category from MassIVE-KB, supplementing as necessary from PROSPECT to obtain 50,000 PSMs per terminal amino acid.